AITopics | comprehensive assessment

Collaborating Authors

comprehensive assessment

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Neural Information Processing SystemsDec-25-2025, 16:58:37 GMT

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance - where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives - including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that GPT-4 follows the (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/.

comprehensive assessment, decodingtrust, trustworthiness, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DLiPath: A Benchmark for the Comprehensive Assessment of Donor Liver Based on Histopathological Image Dataset

Pan, Liangrui, Li, Xingchen, Chen, Zhongyi, Chu, Ling, Peng, Shaoliang

arXiv.org Artificial IntelligenceJun-5-2025

Pathologists comprehensive evaluation of donor liver biopsies provides crucial information for accepting or discarding potential grafts. However, rapidly and accurately obtaining these assessments intraoperatively poses a significant challenge for pathologists. Features in donor liver biopsies, such as portal tract fibrosis, total steatosis, macrovesicular steatosis, and hepatocellular ballooning are correlated with transplant outcomes, yet quantifying these indicators suffers from substantial inter- and intra-observer variability. To address this, we introduce DLiPath, the first benchmark for comprehensive donor liver assessment based on a histopathology image dataset. We collected and publicly released 636 whole slide images from 304 donor liver patients at the Department of Pathology, the Third Xiangya Hospital, with expert annotations for key pathological features (including cholestasis, portal tract fibrosis, portal inflammation, total steatosis, macrovesicular steatosis, and hepatocellular ballooning). We selected nine state-of-the-art multiple-instance learning (MIL) models based on the DLiPath dataset as baselines for extensive comparative analysis. The experimental results demonstrate that several MIL models achieve high accuracy across donor liver assessment indicators on DLiPath, charting a clear course for future automated and intelligent donor liver assessment research. Data and code are available at https://github.com/panliangrui/ACM_MM_2025.

artificial intelligence, assessment, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2506.03185

Country:

North America > United States (0.30)
Asia > China > Hunan Province (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Hepatology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

A System for Comprehensive Assessment of RAG Frameworks

Rengo, Mattia, Beadini, Senad, Alfano, Domenico, Abbruzzese, Roberto

arXiv.org Artificial IntelligenceApr-11-2025

--Retrieval Augmented Generation (RAG) has emerged as a standard paradigm for enhancing the factual accuracy and contextual relevance of Large Language Models (LLMs) by integrating retrieval mechanisms. However, existing evaluation frameworks fail to provide a holistic black-box approach to assessing RAG systems, especially in real-world deployment scenarios. T o address this gap, we introduce SCARF (System for Comprehensive Assessment of RAG Frameworks), a modular and flexible evaluation framework designed to benchmark deployed RAG applications systematically. SCARF provides an end-to-end, black-box evaluation methodology, enabling a limited-effort comparison across diverse RAG frameworks. Our framework supports multiple deployment configurations and facilitates automated testing across vector databases and LLM serving strategies, producing a detailed performance report. Using the REST APIs interface, we demonstrate how SCARF can be applied to real-world scenarios, showcasing its flexibility in assessing different RAG frameworks and configurations. SCARF is available at GitHub repository.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.07803

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Neural Information Processing SystemsJan-18-2025, 21:02:56 GMT

comprehensive assessment, decodingtrust, trustworthiness, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards a vision foundation model for comprehensive assessment of Cardiac MRI

Jacob, Athira J, Borgohain, Indraneel, Chitiboi, Teodora, Sharma, Puneet, Comaniciu, Dorin, Rueckert, Daniel

arXiv.org Artificial IntelligenceOct-6-2024

Cardiac magnetic resonance imaging (CMR), considered the gold standard for noninvasive cardiac assessment, is a diverse and complex modality requiring a wide variety of image processing tasks for comprehensive assessment of cardiac morphology and function. Advances in deep learning have enabled the development of state-of-the-art (SoTA) models for these tasks. However, model training is challenging due to data and label scarcity, especially in the less common imaging sequences. Moreover, each model is often trained for a specific task, with no connection between related tasks. In this work, we introduce a vision foundation model trained for CMR assessment, that is trained in a self-supervised fashion on 36 million CMR images. We then finetune the model in supervised way for 9 clinical tasks typical to a CMR workflow, across classification, segmentation, landmark localization, and pathology detection. We demonstrate improved accuracy and robustness across all tasks, over a range of available labeled dataset sizes. We also demonstrate improved few-shot learning with fewer labeled samples, a common challenge in medical image analyses. We achieve an out-of-box performance comparable to SoTA for most clinical tasks. The proposed method thus presents a resource-efficient, unified framework for CMR assessment, with the potential to accelerate the development of deep learning-based solutions for image analysis tasks, even with few annotated data available.

classification, dataset, segmentation, (14 more...)

arXiv.org Artificial Intelligence

2410.01665

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Interview with Bo Li: A comprehensive assessment of trustworthiness in GPT models

AIHubJan-24-2024, 10:01:37 GMT

Bo Li and colleagues won an outstanding datasets and benchmark track award at NeurIPS 2023 for their work DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In this interview, Bo tells us about the research, the team's methodology, and key findings. We focus on assessing the safety and risks of foundation models. In particular, we provide the first comprehensive trustworthiness evaluation platform for large language models (LLMs). Given the wide adoption of LLMs, it is critical to understand their safety and risks in different scenarios before large deployments in the real world.

large language model, machine learning, natural language, (18 more...)

AIHub

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Add feedback

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Wang, Boxin, Chen, Weixin, Pei, Hengzhi, Xie, Chulin, Kang, Mintong, Zhang, Chenhui, Xu, Chejian, Xiong, Zidi, Dutta, Ritik, Schaeffer, Rylan, Truong, Sang T., Arora, Simran, Mazeika, Mantas, Hendrycks, Dan, Lin, Zinan, Cheng, Yu, Koyejo, Sanmi, Song, Dawn, Li, Bo

arXiv.org Artificial IntelligenceJan-5-2024

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/; our dataset can be previewed at https://huggingface.co/datasets/AI-Secure/DecodingTrust; a concise version of this work is at https://openreview.net/pdf?id=kaHpo8OZw2.

comprehensive trustworthiness evaluation, differential privacy, semantic invariant style transformation, (16 more...)

arXiv.org Artificial Intelligence

2306.11698

Country:

North America > United States > Washington > King County > Seattle (0.13)
North America > United States > Illinois (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(27 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.92)
Research Report > Experimental Study (0.67)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Law (1.00)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating Large Language Models: A Comprehensive Survey

Guo, Zishan, Jin, Renren, Liu, Chuang, Huang, Yufei, Shi, Dan, Supryadi, null, Yu, Linhao, Liu, Yan, Li, Jiaxuan, Xiong, Bojian, Xiong, Deyi

arXiv.org Artificial IntelligenceNov-25-2023

Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.

artificial intelligence conference, eleventh international conference, thirty-second innovative application, (17 more...)

arXiv.org Artificial Intelligence

2310.19736

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > Texas > Travis County > Austin (0.14)
(55 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Research Report > Experimental Study (0.92)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Law (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Deep learning for next-generation sleep diagnostics

#artificialintelligenceSep-26-2020, 19:39:46 GMT

Currently, the diagnosis of sleep disorders relies on polysomnographic recordings with a time-consuming manual analysis with low reliability between different manual scorers. Throughout the night, sleep stages are identified manually in non-overlapping 30-second epochs starting from the onset of the recording based on electroencephalography (EEG), electro-oculography (EOG), and chin electromyography (EMG) signals which require meticulous placement of electrodes. Moreover, the diagnosis of many sleep disorders relies on outdated guidelines. When assessing the severity of obstructive sleep apnea (OSA), the patients are classified based on thresholds of the apnea-hypopnea index (AHI), i.e. the number of respiratory disruptions during sleep. These thresholds are not fully based on solid scientific evidence and remain the same across different measurement techniques.

artificial intelligence, machine learning, sleep staging, (13 more...)

#artificialintelligence

Country:

Europe > Spain (0.05)
Europe > Finland > Northern Savo > Kuopio (0.05)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.74)
Health & Medicine > Diagnostic Medicine > Imaging (0.74)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

[24]7.ai Earns Top Score in Opus Research's Decision Makers' Guide to Enterprise Intelligent Assistants Report 2019 Edition Markets Insider

#artificialintelligenceDec-18-2019, 19:47:13 GMT

The 2019 edition of Opus Research's Decision Makers' Guide to Enterprise Intelligent Assistants report determined [24]7 AIVA to be a top solution for enterprises, and the only virtual agent solution capable of delivering across a breadth of simple FAQs to complex, conversational issues to online transactions. The Opus report presents a comprehensive assessment of 16 enterprise-grade Intelligent Assistant solution providers, with a focus on natural language processing, machine learning, AI, analytics and customer management integration to power digital self-service solutions. The report highlights [24]7 AIVA's ability to support both voice and digital channels and deliver unified self-service, calling out the company's differentiators as being a unique blend of AI and human insights, two decades of unparalleled experience in customer journeys across all channels, and proprietary insights including more than 150 patents and patent applications. "We analyzed a short-list of the leading providers in natural language processing, machine learning, AI and analytics to develop the industry's most comprehensive assessment of today's virtual agents and digital self-service solutions," said Dan Miller, lead analyst, Opus Research. An agent can take over a bot conversation at any time, and hand the conversation back to the bot to complete the interactions.

decision maker, enterprise intelligent assistant report 2019, opus research, (9 more...)

#artificialintelligence

Country: North America > United States > California > Santa Clara County > San Jose (0.06)

Genre: Press Release (0.53)

Industry:

Law > Intellectual Property & Technology Law (0.60)
Banking & Finance > Trading (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.99)

Add feedback